Systems and methods for structural clustering of time sequences

ABSTRACT

Arrangements and methods for performing structural clustering between different time series. Time series data relating to a plurality of time series is accepted, structural features relating to the time series data are ascertained, and at least one distance between different time series via employing the structural features is determined. The different time series may be partitioned into clusters based on the at least one distance, and/or the k closest matches to a given time series query based on the at least one distance may be returned.

CROSS REFERENCE TO RELATED APPLICATION

This application is a continuation application of co-pending U.S. patentapplication Ser. No. 11/096,485 filed on Mar. 31, 2005, the contents ofwhich are hereby incorporated by reference as if set forth fully herein.

This invention was made with Government support under Contract No.:H98230-04-3-001 awarded by the U.S. Department of Defense. TheGovernment has certain rights in this invention.

FIELD OF THE INVENTION

The present invention generally relates to the management of datarelating to time-series representations.

BACKGROUND OF THE INVENTION

Herebelow, numerals set forth in square brackets—[ ]—are keyed to thelist of references found towards the end of the present disclosure.

In recent years, in a constant effort to effect ongoing improvements ina crowded field of knowledge, there has been a profusion of time-seriesdistance measures and representations. The majority of these attempts tocharacterize the similarity between sequences is based solely on shape.However, it is becoming increasingly apparent that structuralsimilarities can provide more intuitive sequence characterizations thatadhere more tightly to human perception of similarity.

While shape-based similarity methods seek to identify homomorphicsequences using original raw data, structure-based methodologies aredesigned to find latent similarities, possibly by transforming thesequences into a new domain, where the resemblance can be more apparent.

Generally, an evolving need has been recognized in connection withproviding an ever more effective and efficient manner of managingtime-series data.

SUMMARY OF THE INVENTION

Broadly contemplated herein, in accordance with at least one presentlypreferred embodiment of the present invention, are methods andarrangements considered for:

(i) efficiently capturing and characterizing (automatically) theperiodicity of time-series;

(ii) characterizing the periodic similarity of time series; and

(iii) combining the above methods to perform periodic clustering oftime-series, where the periodicities of each cluster are also provided.

Techniques such as those outlined above can be applicable in a varietyof disciplines, such as manufacturing, natural sciences and medicine,which acquire and record large amounts of periodic data. For theanalysis of such data, first there is preferably employed accurateperiodicity estimation, which can be utilized either for anomalydetection or for prediction purposes. Then, a structural distancemeasure can preferably be deployed that can effectively incorporate theperiodicity for quantifying the degree of similarity between sequences.It is recognized that a periodic measure can allow for more meaningfuland accurate clustering and classification, and can also be used forinteractive exploration (and visualization) of massive periodicdatasets.

In summary, one aspect of the invention provides a method of performingstructural clustering between different time series, said methodcomprising the steps of: accepting time series data relating to aplurality of time series; ascertaining structural features relating tothe time series data; determining at least one distance betweendifferent time series via employing the structural features; andpartitioning the different time series into clusters based on the atleast one distance.

Another aspect of the invention provides an apparatus for performingstructural clustering between different time series, said apparatuscomprising: an arrangement for accepting time series data relating to aplurality of time series; an arrangement for ascertaining structuralfeatures relating to the time series data; an arrangement fordetermining at least one distance between different time series viaemploying the structural features; and an arrangement for partitioningthe different time series into clusters based on the at least onedistance.

A further aspect of the invention provides a program storage devicereadable by machine, tangibly embodying a program of instructionsexecuted by the machine to perform method steps for performingstructural clustering between different time series, said methodcomprising the steps of: accepting time series data relating to aplurality of time series; ascertaining structural features relating tothe time series data; determining at least one distance betweendifferent time series via employing the structural features; andpartitioning the different time series into clusters based on the atleast one distance.

Yet another aspect of the invention provides a method of quantifying thestructural similarity between different time series, said methodcomprising the steps of: accepting time series data relating to aplurality of time series; ascertaining structural features relating tothe time series data; determining at least one distance betweendifferent time series via employing the structural features; andreturning the k closest matches to a given time series query based onthe at least one distance.

A yet further aspect of the invention provides an apparatus forquantifying the structural similarity between different time series,said apparatus comprising: an arrangement for accepting time series datarelating to a plurality of time series; an arrangement for ascertainingstructural features relating to the time series data; an arrangement fordetermining at least one distance between different time series viaemploying the structural features; and an arrangement for returning thek closest matches to a given time series query based on the at least onedistance.

Furthermore, an additional aspect of the invention provides a programstorage device readable by machine, tangibly embodying a program ofinstructions executed by the machine to perform method steps forquantifying the structural similarity between different time series,said method comprising the steps of: accepting time series data relatingto a plurality of time series; ascertaining structural features relatingto the time series data; determining at least one distance betweendifferent time series via employing the structural features; andreturning the k closest matches to a given time series query based onthe at least one distance.

For a better understanding of the present invention, together with otherand further features and advantages thereof, reference is made to thefollowing description, taken in conjunction with the accompanyingdrawings, and the scope of the invention will be pointed out in theappended claims.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 depicts a reconstruction of a signal from Fourier coefficients.

FIG. 2 depicts a sequence and a corresponding periodogram andautocorrelation graph.

FIG. 3 schematically depicts an “autoperiod” methodology.

FIG. 4 provides a graphical demonstration of the method of FIG. 3.

FIG. 5 depicts an algorithm, “getPeriodHints”.

FIGS. 6( a) through 6(b) depict queries and corresponding periodograms.

FIG. 7 depicts a segmentation of autocorrelation intervals.

FIGS. 8( a) through 8(d) depict periodicity detection results of the“autoperiod” method.

FIG. 9 provides a comparison between two time-series.

FIG. 10 depicts a dendrogram based on historical features.

FIG. 11 depicts a two-dimensional mapping of pairwise distances betweendifferent sequences.

FIG. 12 depicts a dendrogram for a pDist measure, which achieves aperfect clustering.

FIG. 13 depicts incorrect grouping in a 2 class ECG problem.

FIG. 14 depicts correct grouping in a 3 class ECG problem.

DESCRIPTION OF THE PREFERRED EMBODIMENTS

By way of background, provided herebelow is a brief introduction toharmonic analysis using the discrete Fourier Transform, because thesetools will be used as the building blocks of algorithms in accordancewith at least one embodiment of the present invention.

The normalized Discrete Fourier Transform of a sequence x(n), n=0,1 . .. N−1 is a sequence of complex numbers X(f):

${{X\left( f_{k/N} \right)} = {\frac{1}{\sqrt{N}}{\sum\limits_{n = 0}^{N - 1}{{x(n)}^{- \frac{{j2\pi}\; {kn}}{N}}}}}},{k = {{0,1\mspace{14mu} \ldots \mspace{14mu} N} - 1}}$

where the subscript k/N denotes the frequency that each coefficientcaptures. Herethroughout there will also be utilized the notation F(x)to describe the Fourier Transform. Since we are dealing with realsignals, the Fourier coefficients are symmetric around the middle one(or to be more exact, they will be the complex conjugate of theirsymmetric). The Fourier transform represents the original signal as alinear combination of the complex sinusoids

${s_{f}(n)} = {\frac{^{{j2\pi fn}/N}}{\sqrt{N}}.}$

Therefore, the Fourier coefficients record the amplitude and phase ofthese sinusoids, after signal x is projected on them.

One can return from the frequency domain back to the time domain, usingthe inverse Fourier transform F⁻¹(x)≡x(n):

${{x(n)} = {\frac{1}{\sqrt{N}}{\sum\limits_{n = 0}^{N - 1}{{X\left( f_{k/N} \right)}^{\frac{{j2\pi}\; {kn}}{N}}}}}},{k = {{0,1\mspace{14mu} \ldots \mspace{14mu} N} - 1}}$

Note that if during this reverse transformation one discards some of thecoefficients (e.g., the last k), then the outcome will be anapproximation of the original sequence (see FIG. 1). By carefullyselecting which coefficients to record, one can perform a variety oftasks such as compression, denoising, etc.

In order to discover potential periodicities of a time-series, one needsto examine its power spectral density (PSD or power spectrum). The PSDessentially tells us how much is the expected signal power at eachfrequency of the signal. Since period is the inverse of frequency, byidentifying the frequencies that carry most of the energy, we can alsodiscover the most dominant periods. There are two well known estimatorsof the PSD; the periodogram and the circular autocorrelation. Both ofthese methods can be computed using the DFT of a sequence (and cantherefore exploit the Fast Fourier Transform for execution in O(N log N)time).

Suppose that X is the DFT of a sequence x. The periodogram P is providedby the squared length of each Fourier coefficient:

${P\left( f_{k/N} \right)} = {{{{X\left( f_{k/N} \right)}}^{2}\mspace{14mu} k} = {0,1\mspace{14mu} \ldots \mspace{14mu} \left\lceil \frac{N}{2} \right\rceil}}$

Where ∥.∥ denotes the L₂ norm of a vector. Notice that one can onlydetect frequencies that are at most half of the maximum signalfrequency, due to Nyquist's fundamental theorem. In order to find the kdominant periods, one should preferably pick the k largest values of theperiodogram.

Each element of the periodogram provides the power at frequency k/N or,equivalently, at period N/k. Being more precise, each DFT ‘bin’corresponds to a range of periods (or frequencies). That is, coefficientX(f_(k/N)) corresponds to periods

$\left\lbrack {\frac{N}{k}\mspace{14mu} \ldots \mspace{14mu} \frac{N}{k - 1}} \right).$

It is easy to see that the resolution of the periodogram becomes verycoarse for longer periods. For example, for a sequence of length N=256,the DFT bin margins will be N/1,N/2,N/3, . . . =256, 128, 64 etc.

Essentially, the accuracy of the discovered periods deteriorates forlarge periods, due to the increasing width of the DFT bins (N/k).Another related issue is spectral leakage, which causes frequencies thatare not integer multiples of the DFT bin width, to disperse over theentire spectrum. This can lead to ‘false alarms’ in the periodogram.However, the periodogram can still provide an accurate indicator ofimportant short (to medium) length periods. Additionally, through theperiodogram it is easy to automate the extraction of important periods(peaks) by examining the statistical properties of the Fouriercoefficients.

The second way to estimate the dominant periods of a time-series x, isto calculate the circular AutoCorrelation Function (or ACF), whichexamines how similar a sequence is to its previous values for differentτ lags:

${{ACF}(\tau)} = {\frac{1}{N}{\sum\limits_{n = 0}^{N - 1}{{x(\tau)} \cdot {x\left( {n + \tau} \right)}}}}$

where the sum in n+τ is modulo N.

Therefore, the autocorrelation is formally a convolution, and one canavoid the quadratic calculation in the time domain by computing itefficiently as a dot product in the frequency domain using thenormalized Fourier transform:

ACF=F⁻¹<X,X*>

The star (*) symbol denotes complex conjugation.

The ACF provides a more fine-grained periodicity detector than theperiodogram, hence it can pinpoint with greater accuracy even largerperiods. However, it is not sufficient by itself for automaticperiodicity discovery for the following reasons:

1. Automated discovery of important peaks is more difficult than in theperiodogram, because the user must set a significance threshold.

2. Even if the user picks the level of significance, multiples of thesame basic period also appear as peaks. Therefore, the method introducesmany false alarms that need to be eliminated in a post-processing phase.

3. Low amplitude events of high frequency may appear less important(i.e., have lower peaks) than high amplitude patterns, which nonethelessappear more scarcely (see FIG. 2).

With relation to FIG. 2, the 7 day period is latent in theautocorrelation graph, because it has lower amplitude (even though ithappens with higher frequency). However, the 7 day peak is very clear inthe Periodogram.

The advantages and shortcomings of the periodogram and the ACF aresummarized in Table 1.

From the above discussion one can realize that although the periodogramand the autocorrelation cannot provide sufficient spectral informationseparately, there is a lot of potential when both methods are combined.An approach in accordance with at least one preferred embodiment of thepresent invention is delineated in the following section.

In accordance with at least one preferred embodiment of the presentinvention, there is preferably utilized a two-tier approach, byconsidering the information in both the autocorrelation and theperiodogram. One may call this method AUTOPERIOD. Since the discovery ofimportant periods is more difficult on the autocorrelation, one can usethe periodogram for extracting period candidates. The period candidatesmay be termed ‘hints’. These ‘hints’ may be false (due to spectralleakage), or provide a coarse estimate of the period (remember that DFTbins increase gradually in size); therefore a verification phase usingthe autocorrelation is required, since it provides a more fine-grainedestimation of potential periodicities. The intuition is that if thecandidate period from the periodogram lies on a hill of the ACF then onecan consider it as a valid period, otherwise one may preferably discardit as false alarm. For the periods that reside on a hill, furtherrefinement may be required if the periodicity hint refers to a largeperiod.

FIG. 3 summarizes a methodology in accordance with at least oneembodiment of the present invention and FIG. 4 depicts the visualintuition behind such an approach with a working example. The sequenceis obtained from the MSN query request logs and represents the aggregatedemand for the query ‘Easter’ for 1000 days after the beginning of 2002.The demand for the specific query peaks during Easter time and one canobserve one yearly peak. The intuition is that periodicity should beapproximately 365 (although not exactly, since Easter is not celebratedat the same date every year). Indeed the most dominant periodogramestimate is 333.33=( 1000/3), which is located on a hill of the ACF,with a peak at 357 (the correct periodicity—at least for this 3 yearspan). The remaining periodic hints can be discarded upon verificationwith the autocorrelation.

FIG. 4 provides a visual demonstration a method carried out inaccordance with an embodiment of the present invention. Candidateperiods from the periodogram are verified against the autocorrelation.Valid periods are further refined utilizing the autocorrelationinformation.

Essentially, there has been leveraged the information of both metricsfor providing an accurate periodicity detector. In addition, methodscarried out in accordance with at least one embodiment of the presentinvention are computationally efficient, because both the periodogramand the ACF can be directly computed through the Fast Fourier Transformof the examined sequence in O(N log N) time.

For extracting a set of candidate periodicities from the periodogram,one needs to determine an appropriate power threshold that shoulddistinguish only the dominant frequencies (or inversely the dominantperiods). If none of the sequence frequencies exceeds the specificthreshold (i.e., the set of periodicity ‘hints’ is empty), then one canregard the sequence as non-periodic.

In order to specify which periods are important, one should firstpreferably identify how much of the signal energy is attributed torandom mechanisms, that is, everything that could not have beenattributed to a random process should be of interest.

Let us assume that one examines a sequence x. The outcome of apermutation on the elements of x is a sequence {tilde over (x)}. The newsequence will retain the first order statistics of the originalsequence, but will not exhibit any pattern or periodicities, because ofthe ‘scrambling’ process (even though such characteristics may haveexisted in sequence x). Anything that has the structure of {tilde over(x)} is not interesting and should be discarded, therefore at this stepone can record the maximum power (p_(max)) that {tilde over (x)}exhibits, at any frequency f.

$p_{\max} = {\arg {\max\limits_{f}{{\overset{\sim}{x}(f)}}^{2}}}$

Only if a frequency of x has more power than p_(max) can be consideredinteresting. If one would like to provide a 99% confidence interval onwhat frequencies are important, one should repeat the above experiment100 times and record for each one the maximum power of the permutedsequence {tilde over (x)}. The 99^(th) largest value of these 100experiments, will provide a sufficient estimator of the power thresholdp_(T) being sought. Periods (in the original sequence periodogram) whosepower is more than the derived threshold will be considered:

p _(hint) ={N/k:P(f _(k/N))>p _(T)}

Finally, an additional period ‘trimming’ should be performed fordiscarding periods that are either too large or too small and thereforecannot be considered reliable. In this phase any periodic hint greaterthan N/2 or smaller than 2 is removed.

FIG. 5 captures a pseudo-code of the algorithm for identifying periodichints.

In [2] another algorithm for detection of important periods wasproposed, which follows a different concept for estimating theperiodogram threshold. The assumption there was that the periodogram ofnon-periodic time-series will follow an exponential distribution, whichreturned very intuitive period estimates for real world datasets. Inexperimentation, it has been found that the two algorithms return verycomparable threshold values. However, because the new method does notmake any assumptions about the underlying distribution, it can beapplicable for a wider variety of time-series processes.

By way of concrete examples, there were employed sequences from the MSNquery logs (yearly span) to demonstrate the usefulness of the discoveredperiodic hints. In FIG. 6( a) there is presented the demand of the query‘stock market’, where one can distinguish a strong weekly component inthe periodogram. FIG. 6( b) depicts the query ‘weekend’ which does notcontain any obvious periodicities. A method in accordance with at leastone embodiment of the present invention can set the threshold highenough, therefore avoiding false alarms.

TABLE 1 Concise comparison of approaches for periodicity detection.Accurate Accurate short Large Com- Method Easy to threshold periodsPeriods plexity Periodogram Yes yes no O(NlogN) Autocorrelation No yesyes O(NlogN) Combination Yes yes yes O(NlogN)

After the periodogram peaks have been identified, there has beenobtained a candidate set of periodicities for the examined sequence. Thevalidity of these periods will be verified against the autocorrelation.An indication that a period is important, can be the fact that thecorresponding period lies on a hill of the autocorrelation. If theperiod resides on a valley then it can be considered spurious andtherefore safely discarded.

After discovering that a periodicity ‘hint’ resides on a hill of theautocorrelation, one can refine it even further by identifying theclosest peak (i.e., local maximum). This is a necessary step, becausethe correct periodicity (i.e., peak of the hill) might not have beendiscovered by the periodogram, if it was derived from a ‘wide’ DFT bin.This is generally true for larger periods, where the resolution of theDFT bins drops significantly. Below is a discussion of how to addresssuch issues.

The significance of a candidate period ideally can be determined byexamining the curvature of the ACF around the candidate period p. Theautocorrelation is concave downward, if the second derivative isnegative in an open interval (a . . . b):

${\frac{\partial^{2}{{ACF}(x)}}{\partial x^{2}} < 0},{{{for}\mspace{14mu} {all}\mspace{14mu} x} \in \left( {a\mspace{14mu} \ldots \mspace{14mu} b} \right)},{a < p < b}$

Nevertheless, small perturbations of the ACF due to the existence ofnoise, may invalidate the above requirement. There will be sought a morerobust estimator of the curvature by approximating the ACF in theproximity of the candidate period with two linear segments. Then it issufficient to examine if the approximating segments exhibit anupward-downward trend, for identifying a concave downward pattern (i.e.,a hill).

The segmentation of a sequence of length N into k linear segments can becomputed optimally using a dynamic programming algorithm in O(N²k) time,while a greedy merge algorithm achieves results very close to optimal inO(N log N) time. For this problem instance, however, one can employ asimpler algorithm, because only a two segment approximation for aspecific portion of the ACF is required.

Let Ŝ_(a) ^(b) a be the linear regression of a sequence x between thepositions [a . . . b] and

ɛ(Ŝ_(a)^(b))

be the error introduced by the approximating segment. The best splitposition t_(split) is derived from the configuration that minimizes thetotal approximation error:

$t_{split} = {{\arg \; {\min\limits_{t}{ɛ\left( {\hat{s}}_{1}^{t} \right)}}} + {ɛ\left( {\hat{S}}_{t + 1}^{n} \right)}}$

After it has been ascertained that a candidate period belongs on a hilland not on a valley of the ACF, there is a need to discover the closestpeak which will return a more accurate estimate of the periodicity hint(particularly for larger periods). One can proceed in two ways; thefirst one would be to perform any hill-climbing technique, such asgradient ascent, for discovering the local maximum. In this manner thelocal search will be directed toward the positive direction of the firstderivative. Alternatively, one could derive the peak position directlyfrom the linear segmentation of the ACF, which is already computed inthe hill detection phase. The peak should be located either at the endof the first segment or at the beginning of the second segment.

Both methods have been implemented for the purpose of experimentationand it has been found that both report accurate results.

Several sequences from the MSN query logs were employed to performconvincing experiments regarding the accuracy of a 2-tier methodology inaccordance with at least one embodiment of the present invention. Thespecific dataset is ideal for the present purposes because one candetect a number of different periodicities according to the demandpattern of each query.

The examples in FIG. 8 demonstrate a variety of situations that mightoccur when using both the periodogram and autocorrelation.

Query ‘Easter’ (MSN): Examining the demand for a period of 1000 days,one can discover several periodic hints above the power threshold in theperiodogram. In this example, the autocorrelation information refinesthe original periodogram hint (from 333→357 ). Additional hints arerejected because they reside on ACF valleys (in the figure only the top3 candidate periods are displayed for reasons of clarity).

Query ‘Harry Potter’ (MSN): For the specific query although there are noobserved periodicities (duration 365 days), the periodogram returns 3periodic hints, which are mostly attributed to the burst pattern duringNovember when the movie was released. The hints are classified asspurious upon verification with ACF.

Query ‘Fourier’ (MSN): This is an example where the periodogramthreshold effectively does not return candidate periods. Notice that ifone had utilized only the autocorrelation information, it would havebeen more troublesome to discover which (if any) periods were important.This represents another validation that the choice to perform the periodthresholding in the frequency space was correct.

Economic Index (Stock Market): Finally, this last sequence from a stockmarket index illustrates a case where both the periodogram andautocorrelation information concur on the single (albeit weak)periodicity.

Through this experimental testbed it has been demonstrated thatAUTOPERIOD can provide very accurate periodicity estimates withoutupsampling the original sequence. In the sections that follow, it willbe shown how it can be used in conjunction with periodic similaritymeasures, for interactive exploration of sequence databases.

Structural measures can preferably be introduced that are based onperiodic features extracted from sequences. Periodic distance measurescan be used for providing more meaningful structural clustering andvisualization of sequences (whether they are periodic or not). Aftersequences are grouped in ‘periodic’ clusters, using a ‘drill-down’process the user can selectively apply the AUTOPERIOD method forperiodicity estimation on the sequences or clusters of interest. In thediscussion of experimentation examples of this methodology usinghierarchical clustering trees are provided.

Let us consider first the utility of periodic distance measures with anexample. Suppose that one is examining the similarity between the twotime-series of FIG. 9. When sequence A exhibits an upward trend,sequence B displays a downward drift. Clearly, the Euclidean distance(or inner product) between sequences A and B, will characterize them asvery different. However, if one exploits the frequency content of thesequences and evaluates their periodogram, one will discover that it isalmost identical. In this new space, the Euclidean distance can easilyidentify the sequence similarities. Even though this specific examplecould have been addressed in the original space using the Dynamic TimeWarping (DTW) distance, it should be noted that the methods broadlycontemplated herein are significantly more efficient (in terms of bothtime and space) than DTW. Additionally, periodic measures can addressmore subtle similarities that DTW cannot capture, such as differentpatterns/shapes occurring at periodic (possibly non-aligned) intervals.Herebelow, there will be examined cases where the DTW fails.

The new measure of structural similarity presented herein exploits thepower content of only the most dominant periods/frequencies. Byconsidering the most powerful frequencies, the present methodconcentrates on the most important structural characteristics,effectively filtering out the negative influence of noise, andeventually allowing for expedited distance computation. Additionally,the omission of the phase information renders the new similarity measureshift invariant in the time domain. One can therefore discovertime-series with similar patterns, which may occur at differentchronological instants.

For comparing the periodic structure of two sequences, one shouldpreferably examine how different is their harmonic content. One mayachieve this by utilizing the periodogram and specifically thefrequencies with the highest energy.

Suppose that X is the Fourier transform of a sequence x with length n.One can discover the k largest coefficients of X by computing itsperiodogram P(X) and recording the position of the k frequencies withthe highest power content (parameter k depends on the desiredcompression factor). Let us denote the vector holding the positions ofthe coefficients with the largest power p^(+ (so p) ⁺⊂[1 . . . n]). Tocompare x with any other sequence q, one needs to examine how similarenergies they carry in the dominant periods of x. Therefore, onepreferably evaluates P(Q(p⁺)), that describes a sequence holding theequivalent coefficients as the vector P(X(p⁺)). The distance pDistbetween these two vectors captures the periodic similarity betweensequences x and q:

pDist=∥P(Q(p ⁺))−P(X(p ⁺))∥

Example: Let x and q be two sequences and let their respective FourierTransforms be X={(1+2i),(2+2i),(1+i),(5+1)} andQ={(2+2i),(1+i),(3+i),(1+2i)}. The periodogram vector of X is:P(X)=∥X∥²=(5,8,2,26). The vector holding the positions of X with highestenergy is p⁺=(2,4) and therefore P(X(p⁺))=(0,8,0,26). Finally, sinceP(Q)=(8,2,10,5) it follows that: P(Q(p⁺))=(0,2,0,5)¹.

Alternatively, if one doesn't want to provide a parameter k, one couldextract from a sequence those periodic features that retain e% of thesignal energy, or use the algorithm from the periodicity detection toextract the most important periods of a sequence.

In order to meaningfully compare the power content of two sequences oneshould preferably normalize them, so that they contain the same amountof total energy. One can assign to any sequence x(n) unit power, byperforming the following normalization:

${\hat{x}(n)} = \frac{{x(n)} - {\frac{1}{N}{\sum\limits_{i = 1}^{N}{x(i)}}}}{\sqrt{\sum\limits_{i = 1}^{N}\left( {{x(n)} - {\frac{1}{N}{\sum\limits_{i = 1}^{N}{x(i)}}}} \right)^{2}}}$

The above transformation will lead to zero mean value and sum of squaredvalues equal to 1. Parseval's theorem dictates that the energy in thetime domain equals the energy in the frequency domain, therefore thetotal energy in the frequency domain should also be unit:

∥{circumflex over (x)}∥ ² =∥F({circumflex over (x)})∥¹=1

After this normalization, one can more meaningfully compare theperiodogram energies.

Presented herein are the results of extensive experimentation that showthe usefulness of the new periodic measures, and the measures arecompared with widely used shape based measures or newly introducedstructural distance measures.

Using 16 sequences which record the yearly demand of several keywords atthe MSN search engine, one may preferably perform the hierarchicalclustering which is shown in FIG. 10. In the dendrogram derived usingthe pDist as the distance function, one can notice a distinct separationof the sequences/keywords into 3 classes. The first class contains noclear periodicities (no specific pattern in the demand of the query),while the second one exhibits only bursty seasonal trends (e.g., duringChristmas). The final category of queries are requested with highfrequency (weekly period) and here one can find keywords such as‘cinema’, ‘bank’, ‘Bush’ etc.

One can utilize an extended portion of the same dataset for exploringthe visualization power of periodic distance measures. Using thepairwise distance matrix between a set of MSN keyword demand sequences(365 values, year 2002), there is evaluated a 2D mapping of the keywordsusing Multidimensional Scaling (FIG. 11). The derived mapping shows thehigh discriminatory efficacy of the pDist measure; seasonal trends (lowfrequencies) are disjoint from periodic patterns (high frequencies),allowing for a more structural sequence exploration. Keywords like‘fall’, ‘Christmas’, ‘lord of the rings’, ‘Elvis’, etc., manifest mainlyseasonal bursts, which need not be aligned in the time axis. On thecontrary, queries like ‘dry cleaners’ or ‘Friday’ indicate a naturalweekly repeated demand. Finally, some queries do not exhibit any obviousperiodicities within a year's time (e.g., ‘icdm’, ‘kdd’, etc).

For a second experiment there is employed a combination of periodic timeseries that are collected from natural sciences, medicine andmanufacturing, augmented by pairs of random noise and random walk data.

All datasets come in pairs, hence, when performing a hierarchicalclustering algorithm on this dataset, one expects to find a directlinkage of each sequence pair at the lower level of the dendrogram. Ifthis happens one may consider the clustering to be correct. The datasetis made up of 12 pairs, therefore a measure of the clustering accuracycan be the number of correct pair likings, over twelve, the number oftotal pairs.

FIG. 12 displays the resulting dendrogram for the pDist measure, whichachieves a perfect clustering. One can also observe that pairs derivedfrom the same source/process are clustered together as well, in thehigher dendrogram level (Power Demand, ECG, MotorCurrent etc). After theclustering, one can execute the AUTOPERIOD method and annotate thedendrogram with the important periods of every sequence. Some sequences,like the random walk or the random data, do not contain anyperiodicities, indicated with an empty set { }. When both sequences atthe lower level display the same periodicity, a single set is displayedon the bifurcation for clarity.

For many datasets that came into 2 pairs (power demand, videosurveillance, motor current), all 4 instances demonstrated the samebasic period (as suggested by the AUTOPERIOD). However, the periodicmeasure can effectively separate them into two pairs, because the powercontent of the respective frequencies was different.

The last experiment is performed on the MIT-BIH Arrhythmia dataset.There are employed two sets of sequences; one with 2 classes ofheartbeats and another one with three (FIGS. 13, 14). There is presentedthe dendrogram of the pDiSt measure and the DTW, which representspossibly one of the best shape based distance measures. To tune thesingle parameter of the DTW (corresponding to the maximum warpinglength) there were probed several values and here there is reported theone that returned the best clustering.

For both dataset instances, pDist again returns an accurate clustering,while DTW seems to perform badly on the high level dendrogramaggregations, hence not leading to perfect class separation. TheEuclidean distance reported worse results. The CDM measure is accurateon the 2 class separation test but does not provide a perfect separationfor the 3 class problem (see the original paper [1] for respectiveresults).

By way of recapitulation, there have been presented herein variousmethods for accurate periodicity estimation and for the characterizationof structural periodic similarity between sequences. It is believed thatthese methods will find many applications for interactive exploration oftime-series databases and for classification or anomaly detection ofperiodic sequences (e.g., in auto manufacturing, biometrics and medicaldiagnosis).

It is to be understood that the present invention, in accordance with atleast one presently preferred embodiment, includes an arrangement foraccepting time series data, an arrangement for ascertaining structuralfeatures relating to the time series data, and an arrangement fordetermining at least one distance between different time series and/oran arrangement for returning the k closest matches to a given timeseries query. Together, these elements may be implemented on at leastone general-purpose computer running suitable software programs. Theymay also be implemented on at least one integrated Circuit or part of atleast one Integrated Circuit. Thus, it is to be understood that theinvention may be implemented in hardware, software, or a combination ofboth.

If not otherwise stated herein, it is to be assumed that all patents,patent applications, patent publications and other publications(including web-based publications) mentioned and cited herein are herebyfully incorporated by reference herein as if set forth in their entirelyherein.

Although illustrative embodiments of the present invention have beendescribed herein with reference to the accompanying drawings, it is tobe understood that the invention is not limited to those preciseembodiments, and that various other changes and modifications may beaffected therein by one skilled in the art without departing from thescope or spirit of the invention.

REFERENCES

E. Keogh, S. Lonardi, and A. Ratanamahatana. Towards parameter-free datamining. In Proc. of SIGKDD, 2004.

M. Vlachos, C. Meek, Z. Vagena, and D. Gunopulos. Identification ofSimilarities, Periodicities & Bursts for Online Search Queries. In Proc.of SIGMOD, 2004.

1. A method of performing structural clustering between different timeseries, said method comprising the steps of: accepting distinct anddiverse time series data relating to a plurality of time series;ascertaining structural features relating to the time series data;determining at least one distance between different time series viaemploying the structural features; and partitioning the different timeseries into time-invariant clusters containing at least one of the timeseries based on the at least one distance; wherein the clusters arestored in a computer memory.
 2. The method according to claim 1, furthercomprising the step of determining common periodicities corresponding toeach of the clusters.
 3. The method according to claim 1, furthercomprising the step of predetermining a number of structural features tocompute.
 4. The method according to claim 1, wherein: said ascertainingstep comprises ascertaining frequency content relating to the timeseries data; and said ascertaining step further comprises implementing aDiscrete Fourier Transform.
 5. The method according to claim 1, wherein:said ascertaining step comprises determining an orthogonaltransformation relating to the time series data; the orthogonaltransformation of the data comprising a Discrete Wavelet Transform. 6.The method according to claim 1, further comprising the steps of:selecting at least one of said structural features; and said determiningis performed via employing the at least one structural feature selected.7. The method according to claim 6, wherein said selecting step isperformed by a user.
 8. The method according to claim 6, wherein: saidselecting step is performed automatically; and said selecting stepcomprises: identifying candidate structural features; and verifying thecandidate structural features.
 9. The method according to claim 8,wherein: said identifying step comprises: computing a periodogram of thedifferent time series; and identifying peaks of the periodogram; andsaid verifying step comprises: computing an autocorrelation; andselecting identified peaks of the periodogram that lie on hills of theautocorrelation.
 10. The method according to claim 1 wherein saidascertaining step comprises: computing all structural features; andautomatically selecting a number of most relevant features.
 11. Themethod according to claim 10, wherein said step of automaticallyselecting a number of most relevant features comprises: selecting aparameter k corresponding to a number of structural features to keep;and retaining the k features that contain the highest amount of periodiccontent.
 12. The method according to claim 10, wherein: said step ofautomatically selecting a number of most relevant features comprises:selecting a threshold; and retaining features having value larger thanthe threshold; and said step of selecting a threshold comprisesselecting a threshold which serves to discard features having valuesattributable to statistical variations via: computing a resamplingestimate of the distribution of feature values attributable tostatistical variations; selecting a value of probability of type 1error; and selecting as a threshold a value that guarantees the selectedvalue of probability of type 1 error for a distribution equal to theresampling estimate of the distribution.
 13. An apparatus for performingstructural clustering between different time series, said apparatuscomprising: an arrangement for accepting distinct and diverse timeseries data relating to a plurality of time series; an arrangement forascertaining structural features relating to the time series data; anarrangement for determining at least one distance between different timeseries via employing the structural features; and an arrangement forpartitioning the different time series into time-invariant clusterscontaining at least one of the time series based on the at least onedistances; wherein the clusters are stored in a computer memory.
 14. Theapparatus according to claim 13, further comprising an arrangement fordetermining common periodicities corresponding to each of the clusters.15. The apparatus according to claim 13, further comprising anarrangement for predetermining a number of structural features tocompute.
 16. The apparatus according to claim 13, wherein: saidascertaining arrangement is adapted to ascertain frequency contentrelating to the time series data; and said ascertaining arrangement isfurther adapted to implement a Discrete Fourier Transform.
 17. Theapparatus according to claim 13, wherein: said ascertaining arrangementis adapted to determine an orthogonal transformation relating to thetime series data; the orthogonal transformation of the data comprising aDiscrete Wavelet Transform.
 18. The apparatus according to claim 13,further comprising: an arrangement for selecting at least one of saidstructural features; and said determining arrangement is adapted toemploy the at least one structural feature selected.
 19. The apparatusaccording to claim 18, wherein said selecting arrangement is operable bya user.
 20. The apparatus according to claim 18, wherein: said selectingarrangement is operable automatically; and said selecting arrangement isadapted to: identify candidate structural features; and verify thecandidate structural features.
 21. The apparatus according to claim 20,wherein: said identifying arrangement is adapted to: compute aperiodogram of the different time series; and identify peaks of theperiodogram; and said verifying arrangement is adapted to: compute anautocorrelation; and select identified peaks of the periodogram that lieon hills of the autocorrelation.
 22. The apparatus according to claim 13wherein said ascertaining arrangement is adapted to: compute allstructural features; and automatically select a number of most relevantfeatures.
 23. The apparatus according to claim 22, wherein saidarrangement for automatically selecting a number of most relevantfeatures is adapted to: select a parameter k corresponding to a numberof structural features to keep; and retain the k features that containthe highest amount of periodic content.
 24. The apparatus according toclaim 22, wherein: said arrangement for automatically selecting a numberof most relevant features is adapted to: select a threshold; and retainfeatures having value larger than the threshold; and said arrangementfor selecting a threshold is adapted to select a threshold which servesto discard features having values attributable to statistical variationsvia: computing a resampling estimate of the distribution of featurevalues attributable to statistical variations; selecting a value ofprobability of type 1 error; and selecting as a threshold a value thatguarantees the selected value of probability of type 1 error for adistribution equal to the resampling estimate of the distribution.
 25. Aprogram storage device readable by machine, tangibly embodying a programof instructions executed by the machine to perform method steps forperforming structural clustering between different time series, saidmethod comprising the steps of: accepting distinct and diverse timeseries data relating to a plurality of time series; ascertainingstructural features relating to the time series data; determining atleast one distance between different time series via employing thestructural features; and partitioning the different time series intotime-invariant clusters containing at least one of the time series basedon the at least one distance; wherein the clusters are stored in acomputer memory.
 26. A method of quantifying the structural similaritybetween different time series, said method comprising the steps of:accepting distinct and diverse time series data relating to a pluralityof time series; ascertaining structural features relating to the timeseries data; determining at least one distance between different timeseries via employing the structural features; and returning the kclosest matches to a given time series query based on the at least onedistance; wherein the k closet matches are stored in a computer memory.27. The method according to claim 26, wherein the structural featuresare based on at least one of: periodic features extracted from thetime-series; and burst features extracted from the time-series.
 28. Anapparatus for quantifying the structural similarity between differenttime series, said apparatus comprising: an arrangement for acceptingdistinct and diverse time series data relating to a plurality of timeseries; an arrangement for ascertaining structural features relating tothe time series data; an arrangement for determining at least onedistance between different time series via employing the structuralfeatures; and an arrangement for returning the k closest matches to agiven time series query based on the at least one distance; wherein thek closest matches are stored in a computer memory.
 29. The apparatusaccording to claim 28, wherein the structural features are based on atleast one of: periodic features extracted from the time-series; andburst features extracted from the time-series.